Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support qwen2 hf<->mcore ckpt converter #1290

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wenyujin333
Copy link

@wenyujin333 wenyujin333 commented Nov 19, 2024

usage example: examples/qwen/README.md

@wenyujin333 wenyujin333 marked this pull request as ready for review November 19, 2024 05:16
@Victarry
Copy link
Contributor

Victarry commented Dec 4, 2024

Hi @wenyujin333, could you please rebase the MR onto main branch to resolve the conflicts? Thanks

Copy link
Contributor

@Victarry Victarry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work and contribution to Megatron-LM.
I think there are some aspects that could help this MR merged into MCore.

  1. It can be very helpful for users by adding a section of documentation to introduce the workflow to use HF<->MCore converter for Qwen models, like https://github.com/NVIDIA/Megatron-LM/tree/main/examples/mixtral
  2. From the MCore developer's point of view, some codes are kind of difficulty to maintain since the code is complex and not very clear to follow like in saver_qwen2_hf.py. Perhaps we could restructure some sections to make the logic flow more apparent.

tools/checkpoint/loader_qwen2_hf.py Outdated Show resolved Hide resolved
tools/checkpoint/loader_qwen2_hf.py Outdated Show resolved Hide resolved
Comment on lines 336 to 519

# Dense modules
for tp_rank, model in enumerate(models[0]):
layer = get_transformer_block(model).layers[layer_num]
qkv_weight.append(layer.self_attention.linear_qkv.weight.data)
dense_weight.append(layer.self_attention.linear_proj.weight.data)
if md.linear_bias:
qkv_bias.append(layer.self_attention.linear_qkv.bias.data)
elif md.add_qkv_bias:
qkv_bias.append(layer.self_attention.linear_qkv.bias.data)
shared_expert_mlp_l0_weight.append(layer.mlp.shared_experts.linear_fc1.weight.data)
shared_expert_mlp_l1_weight.append(layer.mlp.shared_experts.linear_fc2.weight.data)

layer = get_transformer_block(models[0][0]).layers[layer_num]
router_weight = layer.mlp.router.weight.data
shared_expert_gate_weight = layer.mlp.shared_experts.gate_weight.data

# MoE modules
num_experts_per_rank = margs.num_experts // ep_size
for ep_rank, tp_models in enumerate(models):
for tp_rank, model in enumerate(tp_models):
layer = get_transformer_block(model).layers[layer_num]
for local_expert_idx in range(num_experts_per_rank):
expert_idx = int(ep_rank * num_experts_per_rank + local_expert_idx)
mlp_l0_weight_list[expert_idx].append(layer.mlp.experts.local_experts[local_expert_idx].linear_fc1.weight.data)
mlp_l1_weight_list[expert_idx].append(layer.mlp.experts.local_experts[local_expert_idx].linear_fc2.weight.data)
if md.linear_bias:
mlp_l0_bias_list[expert_idx].append(layer.mlp.experts.local_experts[local_expert_idx].linear_fc1.bias.data)

if md.linear_bias:
# Get non-parallel tensors from tp_rank 0
layer = get_transformer_block(tp_models[0])
for local_expert_idx in range(num_experts_per_rank):
expert_idx = int(ep_rank * num_experts_per_rank + local_expert_idx)
mlp_l1_bias_list[expert_idx].append(layer.mlp.experts.local_experts[local_expert_idx].linear_fc2.bias.data)

mlp_l0_weight_w_list = [[] for _ in range(margs.num_experts)]
mlp_l0_weight_v_list = [[] for _ in range(margs.num_experts)]
# Concat along the tensor parallel dimension
for expert_idx in range(margs.num_experts):
mlp_l0_weight = mlp_l0_weight_list[expert_idx]
if md.swiglu:
for tp_rank in range(tp_size):
mlp_l0_weight[tp_rank] = torch.chunk(mlp_l0_weight[tp_rank], 2, dim=0)
mlp_l0_weight_w_list[expert_idx] = torch.cat([w[0] for w in mlp_l0_weight], dim=0)
mlp_l0_weight_v_list[expert_idx] = torch.cat([w[1] for w in mlp_l0_weight], dim=0)
else:
mlp_l0_weight_list[expert_idx] = torch.cat(mlp_l0_weight, dim=0)
mlp_l1_weight_list[expert_idx] = torch.cat(mlp_l1_weight_list[expert_idx], dim=1)

# Stack along the expert parallel dimension
if md.swiglu:
message["mlp l0 weight W"] = torch.stack(mlp_l0_weight_w_list)
message["mlp l0 weight V"] = torch.stack(mlp_l0_weight_v_list)
for tp_rank in range(tp_size):
shared_expert_mlp_l0_weight[tp_rank] = torch.chunk(shared_expert_mlp_l0_weight[tp_rank], 2, dim=0)
message["shared mlp l0 weight W"] = torch.cat([w[0] for w in shared_expert_mlp_l0_weight], dim=0)
message["shared mlp l0 weight V"] = torch.cat([w[1] for w in shared_expert_mlp_l0_weight], dim=0)
else:
message["mlp l0 weight"] = torch.stack(mlp_l0_weight_list)
message["shared mlp l0 weight"] = torch.cat(shared_expert_mlp_l0_weight, dim=0)
message["shared mlp l1 weight"] = torch.cat(shared_expert_mlp_l1_weight, dim=1)
message["mlp l1 weight"] = torch.stack(mlp_l1_weight_list)

# Concat along TP and stack along EP to biases
if md.linear_bias:
mlp_l0_bias_w_list = [[] for _ in range(margs.num_experts)]
mlp_l0_bias_v_list = [[] for _ in range(margs.num_experts)]
# Concat along the tensor parallel dimension
for expert_idx in range(margs.num_experts):
mlp_l0_bias = mlp_l0_bias_list[expert_idx]
if md.swiglu:
for tp_rank in range(tp_size):
mlp_l0_bias[tp_rank] = torch.chunk(mlp_l0_bias[tp_rank], 2, dim=0)
mlp_l0_bias_w_list[expert_idx] = torch.cat([w[0] for w in mlp_l0_bias], dim=0)
mlp_l0_bias_v_list[expert_idx] = torch.cat([w[1] for w in mlp_l0_bias], dim=0)
else:
mlp_l0_bias_list[expert_idx] = torch.cat(mlp_l0_bias, dim=0)
assert len(mlp_l1_bias_list[expert_idx]) == 1
mlp_l1_bias_list[expert_idx] = mlp_l1_bias_list[expert_idx][0]

# Stack along the expert parallel dimension
if md.swiglu:
message["mlp l0 bias W"] = torch.stack(mlp_l0_bias_w_list)
message["mlp l0 bias V"] = torch.stack(mlp_l0_bias_v_list)
else:
message["mlp l0 bias"] = torch.stack(mlp_l0_bias_list)
message["mlp l1 bias"] = torch.stack(mlp_l1_bias_list)

# Simple concat of the rest
message["qkv weight"] = torch.cat(qkv_weight, dim=0)
message["dense weight"] = torch.cat(dense_weight, dim=1)
if md.linear_bias:
message["qkv bias"] = torch.cat(qkv_bias, dim=0)
elif md.add_qkv_bias:
message["qkv bias"] = torch.cat(qkv_bias, dim=0)

# Do nothing to router
message["router weight"] = router_weight
message["shared gate weight"] = shared_expert_gate_weight
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you refactor this block of code with better structure and reuse the duplicated code for better maintainability?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

tools/checkpoint/saver_qwen2_hf.py Outdated Show resolved Hide resolved
@wenyujin333 wenyujin333 force-pushed the features/qwen_converter branch 2 times, most recently from 0886d56 to 87dd51b Compare December 9, 2024 08:09
@wenyujin333 wenyujin333 force-pushed the features/qwen_converter branch from 87dd51b to 2a07758 Compare December 10, 2024 05:09
@jon-barker
Copy link
Collaborator

jon-barker commented Dec 12, 2024

Hi. Thanks for your contribution. We actually already have HF->mcore non-MOE qwen 2 and 2.5 conversion but it's a little hidden as it's here: https://github.com/NVIDIA/Megatron-LM/blob/main/tools/checkpoint/loader_llama_mistral.py

The usage is currently documented in examples/multimodal/nvlm but we should document it in the main megatron docs

Perhaps you could add the MOE support to what we already have and then we can look to merge your contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants